Pattern Searching Methods and Apparatuses

ABSTRACT

A computer-based method for identifying patterns in computer text using structures defining types of patterns which are to be identified, wherein a structure comprises one or more definition items, the method comprising assigning a weighting to each structure and each definition item; searching the computer text for a pattern to be identified on the basis of a particular structure, a pattern being provisionally identified if it matches the definition given by said particular structure; in a provisionally identified pattern, determining those of the definition items making up said particular structure that have been identified in the provisionally identified pattern; combining the weightings of the determined definition items and optionally, the weighting of the particular structure, to a single quantity; assessing whether the single quantity fulfils a given condition; depending on the result of said assessment, rejecting or confirming the provisionally identified pattern.

This application is a continuation of co-pending U.S. patent applicationSer. No. 11/710,182 filed on Feb. 23, 2007.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The invention relates generally to a system and method for extractingrelevant information from raw text data. More particularly, theinvention concerns itself with a system and method for identifyingpatterns in text using structures defining types of patterns. In thiscontext a “pattern” is to be understood as a part of a written text ofarbitrary length. Thus, a pattern may be any series of alphanumericcharacters within a text. Particular examples of patterns that might beidentified in a text, such as a word-processor document or anemail-text, are dates, events, numbers such as telephone numbers,addresses or names.

2. Description of the Background Art

Technologies for searching interesting patterns in a text presented by acomputer to a user (in the following “computer text”) are well-known.U.S. Pat. No. 5,864,789 is one example of a document describing such atechnology.

A system that searches patterns in a computer text and provides to theuser some actions based on the kind of identified patterns is describedin two variants under http://www.miramontes.com/portfolio/add/ andhttp://www.miramontes.com/portfolio/add/add2.html. The first variant isan application termed “AppleDataDetectors” and the second variant anapplication termed “LiveDoc”.

Both variants use the same method to find patterns in an unstructuredtext. The engine performing the pattern search refers to a librarycontaining a collection of structures, each structure defining a patternthat is to be recognized. FIG. 1 gives an example of seven differentstructures (#1 to #7), which may be contained in such a structurelibrary. Each of the seven structures shown in FIG. 1 defines a patternworth recognizing in a computer text. The definition of a pattern is asequence of so-called definition items. Each definition item specifiesan element of the text pattern that the structure recognizes. Adefinition item may be a specific string or a structure defining anotherpattern using definition items in the form of strings or structures. Forexample, structure #1 gives the definition of what is to be identifiedas a US state code, the definition following the “:=” sign. According tothis definition, a pattern in a text will be identified as a US statecode if it corresponds to one of the strings between quotation marks,i.e. one of the definition items, such as AL or AK or WY (Note that thesymbol “|” means “OR”).

The structure #7 gives a definition of what is to be identified as astreet address. In this context, a street address is to be understood asa postal address excluding the name of the recipient. A typical exampleof a street address is: 225 Franklin Street, 02110 MA Boston. Accordingto the definition given by structure #7, a pattern is a street addressif it has elements matching the following sequence of definition items:

-   -   a number in the sense as defined by structure #4, followed by    -   some spaces, followed by    -   some capitalized words, followed by,    -   optionally, a known street type in the sense as defined by        structure #5 (the optional nature being indicated by the        question mark behind the brackets surrounding        “known_street_type”), followed by    -   a coma or spaces, followed by,    -   optionally, a postal code in the sense as defined by structure        #3, followed by    -   some spaces, followed by    -   a city in the sense as defined by structure #6.

This definition of a street address is deliberately broad in order toensure that the application is able to identify not only streetaddresses written according to a single specific notation but alsoaddresses written according to differing notations.

However, an application using such a broad definition is prone to thedetection of a large number of false positives. For example, with thedefinition of a street address given above, the pattern “4 Apple Pies”will be wrongly recognized as a street address. The obvious solution toreduce the number of false positives is to make the structuredefinitions narrower. Yet, with narrow definitions there is an increasedrisk of missing interesting patterns.

At least certain embodiments of the present invention provide a methodand system for identifying patterns in text using structures, whichincrease the flexibility of structure definitions and which, inparticular, permit the formulation of structure definitions that lead tomore accurate results during pattern identification.

SUMMARY OF THE DESCRIPTION

A computer-based method, in one embodiment, for identifying patterns intext using structures defining types of patterns which are to beidentified, wherein a structure comprises one or more definition items,and wherein the methods include assigning a weighting to each structureand each definition item; searching the text for a pattern to beidentified on the basis of a particular structure, a pattern beingprovisionally identified if it matches the definition given by saidparticular structure; in a provisionally identified pattern, determiningthose of the definition items making up said particular structure thathave been identified in the provisionally identified pattern; combiningthe weightings of the determined definition items and optionally, theweighting of the particular structure, to a single quantity; assessingwhether the single quantity fulfils a given condition; depending on theresult of said assessment, rejecting or confirming the provisionallyidentified pattern.

Through the introduction of weightings for each structure and definitionitem, pattern definition and identification becomes more flexible andaccurate. Indeed, in contrast to the conventional method of patternidentification, at least certain embodiments of a method of theinvention introduce a supplementary test for the identification ofpatterns. It is no longer sufficient for a pattern to be recognized thatit matches the definition of the corresponding structure. On top ofthat, at least certain embodiments of the invention use a secondprocedure which consists in performing a sort of plausibility check. Theweightings of the definition items of the relevant structure that havebeen matched to the elements of the provisionally identified patternmust in combination fulfill a given condition. If this is the case, itis assumed that the identified pattern is sufficiently likely to reallycorrespond to the relevant structure (e.g., if the structure definestelephone numbers, when the given condition is met by the combinedweightings, it is assumed that the identified pattern is indeed atelephone number and not a false positive).

The introduction of weightings and of a probability test based on thoseweightings allows for structures with broad pattern definitions withoutthe risk of an overly high number of false positives. A structure havinga broad definition will lead to a lot of incorrect matches. However,these false positives may then be “sieved out” with the described“plausibility test” based on the assigned weightings. The weightings areassigned to the structures and definition items such that the combinedweightings of a false positive are very unlikely to fulfill the givencondition. The use of weightings gives more flexibility and freedom inthe definition of structures and definition items.

A machine-implemented method is a method which is preferably implementedvia a data processing system such as a computer. The term “computer”includes any data processing system such as any computing device as, forexample, a desktop computer, laptop, personal digital assistant, mobilephone, multimedia device, notebook, or other consumer electronic devicesand similar devices.

In the present context, a weighting is a quantity used to emphasize, tosuppress or even to penalize a structure or definition item associatedwith it. A structure with a greater weighting is considered to be moredesirable or more accurate than a structure with a lower, no or even anegative weighting. Preferably, the weighting is a number and inparticular an integer. In the latter case, each weighting may take theform of either a bonus in the form of a positive integer, or a malus inthe form of a negative integer. Within the context of the invention, theterm “malus” is to be understood as being the antonym of the term“bonus”. A “malus” may also be qualified as a penalty.

A bonus may be assigned to a structure or definition item if it iswell-defined, meaning that there is a high probability for correctpattern identification if the identified pattern contains said structureor definition item. A malus or penalty may be assigned if the structureor definition item is ambiguous. This may mean that the structure ordefinition item allows different interpretations, only one of whichleads to correct pattern identification. It may also mean that thestructure or definition item defines a set of elements of which only asubset may be contained in the pattern sought-after.

In a preferred embodiment, each weighting is an integer multiple of thesame integer. Accordingly, the weightings may be quantized as multiplesof a single integer. This renders the weighting scheme of the inventionmore manageable and easier to implement.

In a most preferred embodiment, the weightings are quantized asmultiples of the integer “1”, meaning that the whole integer range isused for the weightings.

Preferably, the given condition corresponds to the single quantity beingabove or below a given threshold. Furthermore, the single quantity maybe obtained by combining the weightings using one or more arithmeticoperations, such as addition, subtraction, multiplication and/ordivision. The most preferred arithmetic operation is a summation overall weightings, the single quantity being the sum of all the weightings.

In a further aspect of the invention, which may also be implementedindependently from the inventive weighting scheme described above, thestructures are automatically generated or extended on the basis ofinformation available from a data source, such as a calendar applicationor an address book application. For example, a structure defining thepattern “city name” may be automatically completed by the system withthe help of city names fetched from an address book applicationcontaining postal addresses of user contacts or from another source ofcity names such as a locally stored (or remotely stored) database whichincludes city names. Each time a new contact is added to the addressbook, the corresponding city name may be automatically added to thestructure “city name”. This feature, which may be termed “automaticlearning system”, leads to an automatic increase in the knowledge baseof known patterns and an automatic improvement of pattern detection asthe system learns more and more from the data sources of the user. Inparticular, thanks to this “automatic learning” feature, there is lessneed for a programmer or user to actively administrate and update thestructures and definition items as this is done “on the fly” by thesystem itself.

In yet a further aspect of the invention, which may as well beimplemented independently from the inventive weighting scheme describedabove, the computer text is indexed using the patterns identified in itin order to improve search capabilities of computer texts. This meansthat interesting patterns that have been found in a text using theinventive or any other pattern identification method may be used to tagthe text with corresponding metadata. In this way, any computer text canbe flagged with all the patterns that have been identified in it. Thistype of text indexing may be used for more advanced searches in adesktop search application such as “Spotlight” from Apple Inc. ofCupertino, Calif. For example, thanks to the new metadata represented bythe identified patterns, one may query all the texts that contain a datewithin a certain range or that contain a street address near a givencity.

The inventive methods may be implemented in a computer-based systemoperable to execute said methods, the term “computer-based system”including any data processing system such as any computing device as,for example, a desktop computer, laptop, personal digital assistant,mobile phone, multimedia device, notebook, or other consumer electronicdevices and similar devices. In a typical embodiment, a data processingsystem includes one or more processors which are coupled to memory andto one or more buses. The processor(s) is also typically coupled toinput/output devices through the one or more buses. Examples of dataprocessing systems are shown and described in U.S. Pat. No. 6,222,549,which is hereby incorporated herein by reference.

The inventive methods may also be implemented as a program storagemedium having a program stored therein for causing a computer or otherdata processing system to execute said inventive methods. A programstorage medium may be a hard disk drive, a USB stick, a CD, a DVD, amagnetic disk, a Read-Only Memory (ROM), or any other computer storagemeans.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following, a preferred embodiment of the invention will bedescribed, with reference to the accompanying drawings, in which:

FIG. 1 is a listing showing examples of conventional structuredefinitions;

FIG. 2 is a block diagram showing the main elements of the preferredembodiment of the inventive pattern identification system;

FIG. 3 is a flow chart illustrating the main operations of a preferredpattern detection application, as seen by the user, implementing theinventive pattern identification method;

FIGS. 4 a and 4 b show a first example of the user experience providedby the pattern detection application of FIG. 3;

FIG. 5 shows a second example of the user experience provided by thepattern detection application of FIG. 3;

FIGS. 6 a to 6 e show a third example of the user experience provided bythe pattern detection application of FIG. 3;

FIG. 7 is a listing showing examples of structure definitions accordingto the invention, in contrast to the conventional definitions of FIG. 1;

FIG. 8 is a flow chart illustrating an embodiment of a patternidentification method using weighted structures and definition items.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

FIG. 2 gives an overview of the inventive pattern identification system2 and the way in which the system identifies interesting patterns. Thecore of the system 2 is the pattern search engine 4, which implementsthe inventive pattern identification method using weightings.

The engine 4 receives a text 6, which is to be searched for knownpatterns. This text 6 may be a word processor document or an emailmessage. The text is often encoded in some standards-based format, suchas ASCII or Unicode. If system 2 is implemented in a mobile phone, thetext 6 may also be an SMS or MMS message. If system 2 is part of aninstant messaging application, such as iChat from Apple Inc. ofCupertino, Calif., the text 6 may be a message text received via such aninstant messaging application. As a further example, text 6 may alsocorrespond to a web page presented by a web browser, such as Safari fromApple Inc. of Cupertino, Calif. Generally, text 6 may correspond to anytext entity presented by a computing device to a user.

The text 6 is searched for patterns by the engine 4 according tostructures and rules 8. The structures and rules 8 are formulatedaccording to the inventive pattern identification method usingweightings. The search by engine 4 yields a certain number of identifiedpatterns 10. These patterns 10 are then presented to the user of thesearched text 6 via user interface 12. For each identified pattern, theuser interface 12 may suggest a certain number of actions 14. Forexample, if the identified pattern is a URL address the interface 12 maysuggest the action “open corresponding web page in a web browser” to theuser. If the user selects the suggested action a correspondingapplication 16 may be started, such as, in the given example, the webbrowser.

The suggested actions 14 preferably depend on the context 18 of theapplication with which the user manipulates the text 6. Morespecifically, when performing an action 14, the system can take intoaccount the application context 18, such as the type of the application(word processor, email client, . . . ) or the information availablethrough the application (time, date, sender, recipient, reference, . . .) to tailor the action 14 and make it more useful or “intelligent” tothe user.

Of course, the type of suggested actions 14 does also depend on the typeof the associated pattern. If the recognized pattern is a phone number,other actions will be suggested than if the recognized pattern is apostal address.

FIG. 3 gives an example of the process of pattern detection as perceivedby the user. Let us assume that a user of a desktop computer iscurrently manipulating a text document via a word processingapplication. The word processor presents the text on the screen of thecomputer (operation 1). While the user manipulates the text, a patternsearch engine 4, which, in FIG. 3, is called a “Data Detector Engine”,searches the text for known patterns 20. The search engine 4 preferablyincludes user data 22 in the structures of known patterns 20, which itmay obtain from various data sources including user relevantinformation, such as a database of contact details included in anaddress book application or a database of favorite web pages included ina web browser. Adding user data 22 automatically to the set ofidentifiable patterns 20 renders the search user specific and thus morevaluable to the user. Furthermore, this automatic addition of user datarenders the system adaptive and autonomous, saving the user from havingto manually add its data to the set of known patterns.

The pattern search is done in the background without the user noticingit. However, when the user places his mouse pointer over a text elementthat has been recognized as an interesting pattern having actionsassociated with it, this text element is visually highlighted to theuser (operations 2 and 3 in FIG. 3).

The patterns identified in the text could of course also be highlightedautomatically, without the need of a user action. However, it ispreferred that the highlighting is only done upon a mouse rollover sothat it is less intrusive.

The area highlighted by a mouse rollover includes a small arrow. Theuser can click on this arrow in order to visualize actions associatedwith the identified pattern in a contextual menu (operations 4 and 5 inFIG. 3). The user may select one of the suggested actions, which is thenexecuted (operations 6 and 7 in FIG. 3).

FIGS. 4 to 6 give three examples of the process illustrated in FIG. 3,as it is seen by the user on his screen.

In FIGS. 4 a and 4 b, the text is an email message 24 sent by “Alex” to“Paul”. Paul has opened the message 24 in its email client. Once themessage has been opened, the pattern search engine automatically scansthe text for interesting patterns. In the example of FIGS. 4 a and 4 bthe engine has identified two interesting patterns: a telephone number26 and a fax number 28. These two patterns are only brought to theattention of the user Paul by highlighting when he positions his mousepointer 30 over the phone or fax number. This situation is shown in FIG.4 a. Paul may then click on the small arrow 32 at the right hand end ofthe area highlighting the telephone number 26 in order to open a contextmenu 34. (cf. FIG. 4 b). The context menu includes several possibleactions that the user Paul might want to perform on telephone number 26.For example, Paul may add the telephone number to his address book bychoosing the corresponding action 36. If Paul chooses action 36, hisaddress book application will be automatically started, including a newentry with the telephone number 26. Preferably, the systemauto-completes the new entry with other relevant data that it can deducefrom the email message 24. For example, the system may automaticallyextract the name of the person associated with phone number 26 from the“From” line 38 of the email message 24. The system may alsoautomatically add the fax number 28 to the new entry. Thus, in thepresent example, the new address book entry created by executing action36 will already contain the name, telephone and fax number. Paul maythen add the missing information manually.

Action 40, named “Large Type”, allows Paul to obtain a magnified view ofthe telephone number so that he can read it off the screen easily whendialing.

FIG. 5 shows a second example, again with an email message as the searchtext. The action being executed in FIG. 5 is the creation of a new entry50 in an address book based on the address pattern 42 detected in theemail message. The detected pattern 42 is made of three elements 44, 46and 48. The three elements have been identified as a name, a street anda city by the pattern search engine and accordingly have beenautomatically inserted in the adequate fields in the new entry 50, asdepicted by the arrows. Furthermore, the system has determined thataddress pattern 42 is not a complete postal address. Indeed, addresspattern 42 lacks a country code and a ZIP code. In the example shown inFIG. 5, the system retrieves this missing information from an externaldatabase 52. The system queries the database 52 using the informationextracted from address pattern 42 (street and city) and database 52returns the missing country code and ZIP code, as shown by the arrows.

There may be a special highlight in entry 50 to indicate to the userthat some fields have been auto-completed.

Of course, the various embodiments of the invention are not limited tothis specific example. The system may obtain any kind of supplementaryinformation from any available data source in order to automate andenhance the action initiated by the user.

FIGS. 6 a to 6 e give a third example, again involving an email message.This time, the message contains a pattern indicative of an appointment.The appointment is part of the first sentence of the message, as can beseen from FIG. 6 a. This pattern is identified by the pattern searchengine and highlighted as soon as the user places his cursor 30 on theappointment pattern (cf. FIG. 6 b). Clicking on the arrow 32, the userinitiates the action “New Calendar Event” associated with the identifiedpattern (cf. FIG. 6 c). FIG. 6 d shows the new calendar entry 54 thathas been automatically created by the system. The pattern search enginehas also identified the element 56 “dinner” located next to theappointment pattern 58 as a separate event pattern. Thus, the system isable to identify patterns that are related.

Two patterns might be regarded as related if they are in close proximityto each other in the text. When the user rolls over one of severalrelated patterns, both patterns may be highlighted to express theirrelatedness.

The information represented by the event pattern 56 is automaticallyentered in the head line field of the new entry 54, as indicated by thearrow. Furthermore, the date of the meeting 60 is automaticallygenerated on the basis of the appointment pattern 58. As pattern 58 isonly a contextual date indication (“tomorrow at 7:30 p.m.”), which needsto be interpreted in the light of the context of the message, the systemcannot simply copy pattern 58 into the new entry 54. The system solvesthis by obtaining the date of the email message from the email client ofthe user. Knowing the date of the email, the system can infer the exactdate of the indication “tomorrow” and enter it into the entry 54. Thisprocess of using context information to deduce accurate information fromcontext dependent patterns is visualized in FIG. 6 d by the two arrowsand the “Context box”.

The new entry 54 may also contain a URL 62 of a special kind that pointstoward the original email message, allowing the user to return to theemail message when viewing entry 54.

FIG. 6 e shows the result of the action “New Calendar Event”: a newevent has been created in the user's calendar application.

FIG. 7 shows examples of structure definitions according to theinvention. These structures are used by the pattern search engine torecognize interesting patterns. The structures #1, #5, #6 and #7 of FIG.7 are similar to the conventional ones of FIG. 1, with one majordifference. In FIG. 7, each structure #1, #5, #6 and #7 has been given abonus or weighting 64. This bonus is an integer multiple of 5.Structures #1 and #5 have each been given a bonus of +5 whereasstructure #7 has been given a bonus of −10 (i.e. a malus). Withinstructure #6, the first of the two definition items (“known city”) hasbeen given a bonus of +5.

Structures #1, #5 and the structure “known city” have been given apositive bonus because their respective definitions are rather precise,meaning that a pattern matching the definition is highly likely to be ofthe type defined by the structure. For example, structure #5 is a simpleenumeration of strings which are known to represent streets, such as“Street” or “Boulevard” or “Road”. There is a high probability that apattern in a text that corresponds to such a string is indeed of the“Street” type.

Structure #7 has been given a malus of −10, because, as discussedearlier on, its definition is rather broad, potentially including asubstantial number of false positives.

Structures #1 and #5 may be elaborated further by assigning weightingsto their respective definition items. For example, structure #1 maycontain the definition item “ID” referring to the US state Idaho (notshown). This definition item is preferably given a malus of −5 becausethe string “ID” is ambiguous. Indeed, “ID” may not only be used in atext as an abbreviation for “Idaho” but also for “Identification”.

Structure #5 may contain the string “Drive” as one of its definitionitems in order to cover the “street type” “Drive” (not shown). However,this definition item should be given a malus as the string “Drive” mayappear in various contexts in a computer text, not necessarily being asynonym for “Street”.

The pattern identification method of the invention will now be describedin detail with reference to FIG. 8, using as an example the structuresshown in FIG. 7.

Operation 100 of FIG. 8 corresponds to the creation of a new structurewith an associated definition. As an example, operation 100 may involvethe definition of the “street address” structure #7 of FIG. 7. Structure#7 is defined as written in FIG. 7.

With operation 102, structure #7 is given a weighting w, namely w=−10 asthe structure is rather broad in its definition of what may constitute astreet address. Structure #7 having been defined and assigned aweighting, it may then be used by the pattern search engine to searchfor corresponding patterns in a text (operation 104).

Let us introduce two example texts that are to be searched by the searchengine using structure #7:

Text 1:

“Our offices are located at 225 Franklin Street, 02110 MA Boston”

Text 2:

“The boys ate 4 Apple Pies”

With the conventional method using structure #7 without the weightingscheme, the underlined patterns in each of the two texts would each beidentified as a “street address”, leading to a false positive in thecase of Text 2.

It will now be explained how the use of the inventive weighting schemesuppresses the false positive in Text 2 while detecting the correctpattern in Text 1.

In the inventive method, in the same way as the conventional method,both texts are searched for a match with the definition given bystructure #7 (operation 106). If no match is found, the method goes onsearching for other patterns using other structures (operation 108).However, if a match is found, “225 Franklin Street, 02110 MA Boston”(pattern 1) and “4 Apple Pies” (pattern 2) in the two texts above, it isnot immediately validated as it was done conventionally. Rather, it isdetermined which of the definition items of the structure have beenfound in the identified pattern (operation 110).

Pattern 1 is therefore decomposed as follows:Number: 225; some spaces; some capitalized words: Franklin; known streettype: Street; coma; postal code: 02110 MA; some spaces; city: Boston.Pattern 2 is decomposed as follows:Number: 4; some spaces; some capitalized words: Apple; spaces; somespaces; city: Pie.

The next step is to calculate the sum of the weightings of allidentified definition items, to which is added the weighting of thestructure, giving a total sum of A (operation 112).

In the case of pattern 1, obtain for A the value of 5 (c f. FIGS. 1 and7):

A bonus of +5 for the presence of a known street type (structure #5),plusA bonus of +5 for the presence of a structure #1 “US state code” withinthe identified structure #3 “postal code”,plusA bonus of +5 for the presence of a structure “known city” within thestructure #6 “city” (assuming that Boston matches the definition of thestructure “known city”, which is not shown in the figures),plusA malus of −10 associated with the structure #7 “street address”.

In the case of pattern 2, we obtain for A a value of −10, the value ofthe malus associated with structure #7, since the elements of thepattern “4 Apple Pies” do not match any of the definition items with abonus.

In operation 114, A is then compared to a predetermined threshold, here0. Accordingly, pattern 1 is confirmed since A=5>0 (operation 116),whereas pattern 2 is rejected since A=−10<0 (operation 118).

Hence, with the inventive weighting scheme, contrary to the prior art,false positives such as “4 Apple Pies” are spotted and discarded. Theinventive method therefore renders pattern searching more effective andaccurate.

1. A machine-implemented method for identifying patterns in text usingstructures defining types of patterns which are to be identified,wherein a structure comprises one or more definition items, the methodcomprising: searching a text data for a pattern to be identified on thebasis of a particular structure, a pattern being provisionallyidentified if it matches the definition given by said particularstructure, wherein said structures are automatically extended on thebasis of information available from a data source; assigning a weightingto each definition item in the provisionally identified pattern;rejecting or confirming the provisionally identified pattern based upona calculation which combines the weightings of the definition items inthe provisionally identified pattern.
 2. The method of claim 1, whereinthe calculation creates a single quantity which is compared to a giventhreshold.
 3. The method of claim 2, wherein the calculation is obtainedby combining the weightings using one or more arithmetic operations. 4.The method of claim 3, wherein the arithmetic operation is a summationover all weightings, the single quantity being the sum of all theweightings.
 5. The method of claim 1, each weighting being defined as anumber.
 6. The method of claim 5, each weighting taking the form ofeither a bonus in the form of a positive integer, or a malus in the formof a negative integer.
 7. The method of claim 6, wherein a structure ordefinition item is assigned a bonus if it is well-defined, and a malusif it is ambiguous.
 8. The method of claim 5, each weighting being aninteger multiple of the same integer.
 9. A machine-readablenon-transitory storage medium storing a program for causing a dataprocessing system to perform a method for identifying patterns in textusing structures defining types of patterns which are to be identified,wherein a structure comprises one or more definition items, the methodcomprising: searching a text data for a pattern to be identified on thebasis of a particular structure, a pattern being provisionallyidentified if it matches the definition given by said particularstructure, wherein said structures are automatically extended on thebasis of information available from a data source; assigning a weightingto each definition item in the provisionally identified pattern;rejecting or confirming the provisionally identified pattern based upona calculation which combines the weightings of the definition items inthe provisionally identified pattern.
 10. The medium of claim 9, whereinthe calculation creates a single quantity which is compared to a giventhreshold.
 11. The medium of claim 10, wherein the calculation isobtained by combining the weightings using one or more arithmeticoperations.
 12. The medium of claim 11, wherein the arithmetic operationis a summation over all weightings, the single quantity being the sum ofall the weightings.
 13. The medium of claim 9, each weighting beingdefined as a number.
 14. The medium of claim 13, each weighting takingthe form of either a bonus in the form of a positive integer, or a malusin the form of a negative integer.
 15. The medium of claim 14, wherein astructure or definition item is assigned a bonus if it is well-defined,and a malus if it is ambiguous.
 16. The medium of claim 13, eachweighting being an integer multiple of the same integer.
 17. The mediumof claim 12, wherein the method further comprises indexing the text datafor searching.
 18. A data processing system which performs a method foridentifying patterns in text using structures defining types of patternswhich are to be identified, wherein a structure comprises one or moredefinition items, the data processing system comprising: means forsearching a text data for a pattern to be identified on the basis of aparticular structure, a pattern being provisionally identified if itmatches the definition given by said particular structure, wherein saidstructures are automatically extended on the basis of informationavailable from a data source; means for assigning a weighting to eachdefinition item in the provisionally identified pattern; means forrejecting or confirming the provisionally identified pattern based upona calculation which combines the weightings of the definition items inthe provisionally identified pattern.
 19. The data processing system ofclaim 18, wherein the calculation creates a single quantity which iscompared to a given threshold.
 20. The data processing system of claim19, wherein the calculation is obtained by combining the weightingsusing one or more arithmetic operations and wherein the arithmeticoperation is a summation over all weightings, the single quantity beingthe sum of all the weightings.